Overview

BioTIP is an R-package designated to detect and characterize biological tipping points in dynamic systems (e.g. time-course or pseudotime gene expression data). The BioTIP R-package can be found on Yang Lab’s GitHub. In particular, installation instructions are in the How to install? section.

In this page, we will present our BioTIP user-friendly wrap function and illustrate to experienced users how to apply BioTIP to processed single-cell experimental data, understand BioTIP’s general workflow and key functions, and obtain graphic results and data. Note that in this guide we frequently save generated data and reload saved data. We do this to save expensive simulation time when generating this web page from R-markdown. We would also like to remind the readers of that our most up-to-date package is available on GitHub.

For readers’ convenience, the following chart serves as navigation as well as an outline of standard workflow in this demonstrative experiment.

Getting Started

Dataset Introduction

The experiment outlined in this page makes use of the single-cell RNA-seq for mouse organogenesis ([GSE87038](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE87038)). In particular, we picked the single-cell RNA-seq collected at 8.25 days post-fertilization (E8.25). This dataset is provided in the R-pacakge MouseGastrulationData, which can be imported by the following R-code:

library(MouseGastrulationData)

Go to Top: Dataset Introduction

Go to Top: Getting Started

Dependency Installation

The following R-code imports required libraries to process data and apply BioTIP analysis.

library(batchelor);library(cluster); library(corrplot); library(dynamicTreeCut); library(edgeR); library(gplots); library(gridExtra); library(igraph); library(limma); library(org.Mm.eg.db); library(pheatmap); library(psych); library(scater); library(scran); library(SingleCellExperiment); library(stringr)
source("BioTIP_update_5_012022020_v2.R")

Go to Top: Dependency Installation

Go to Top: Getting Started

Data Preprocessing

Data Preprocessing, Quality Control, and Normalization

In this demonstration, we selected early organogenesis at embryonic day (E8.25) when mesodermal layers were connected (Pijuan-Sala et al., 2019), which were numbered sample 24, 25, and 28 in AtlasSampleMetadata.

subset(AtlasSampleMetadata, stage=='E8.25')
##    sample stage pool_index seq_batch ncells
## 23     24 E8.25         19         2   6707
## 24     25 E8.25         19         2   7289
## 27     28 E8.25         21         2   4646
sce.24 <- EmbryoAtlasData(samples = 24)
rownames(sce.24) <- uniquifyFeatureNames(rowData(sce.24)$ENSEMBL, rowData(sce.24)$SYMBOL)
sce.25 <- EmbryoAtlasData(samples = 25)
rownames(sce.25) <- uniquifyFeatureNames(rowData(sce.25)$ENSEMBL, rowData(sce.25)$SYMBOL)
sce.28 <- EmbryoAtlasData(samples = 28)
rownames(sce.28) <- uniquifyFeatureNames(rowData(sce.28)$ENSEMBL, rowData(sce.28)$SYMBOL)
table(grepl("^MT-", rownames(sce.24)))
table(grepl("^MT-", rownames(sce.25)))
table(grepl("^MT-", rownames(sce.28)))

The authors of GSE87038 has already performed quality control. Additionally, we remove cells labeled as stripped nuclei or doublets, and check if any cell has high mitochondrial gene content.

drop.24 <- (sce.24$doublet | sce.24$stripped)
drop.25 <- (sce.25$doublet | sce.25$stripped)
drop.28 <- (sce.28$doublet | sce.28$stripped)
sce.24 <- sce.24[,!drop.24]
sce.25 <- sce.25[,!drop.25]
sce.28 <- sce.28[,!drop.28]

In addition, we compute the log-transformed, normalized expression from the count matrices sce.24, sce.25, and sce.28.

sce.24 <- logNormCounts(sce.24)
sce.25 <- logNormCounts(sce.25)
sce.28 <- logNormCounts(sce.28)

Go to Top: Data Preprocessing, Quality Control, and Normalization

Go to Top: Data Preprocessing

Feature Selection

For each selected E8.25 sample, we use the modelGeneVar() function (provided in the R-package scran) to model the variance and decompose the sample into technical and biological components based on a fitted mean-variance trend. The R-objects named dec24, dec25, and dec28 in the following code chunk are the returned dataframe from sce24, sce25, sce28, respectively. Every row of the returned dataframe corresponds to a gene in the original expression matrix. We are only interested in the genes with positive biological component. We denote the set of boolean indicators of whether a gene has positive biological component by dec.expressed.

dec24 <- modelGeneVar(sce.24)
dec25 <- modelGeneVar(sce.25)
dec28 <- modelGeneVar(sce.28)
dec.expressed <- dec24$bio > 0 & dec25$bio>0 & dec28$bio>0
save(dec24, dec25, dec28, dec.expressed, file=paste0("dec.RData"))

Next, we use the multiBatchNorm() function to recompute log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment objects.

sce <- multiBatchNorm(sce.24, sce.25, sce.28)
sce.24 <- sce[[1]]
sce.25 <- sce[[2]]
sce.28 <- sce[[3]]

Go to Top: Feature Selection

Go to Top: Data Preprocessing

Dimension Reduction

We make use ofUniform Manifold Approximation and Projection (UMAP) method for dimension reduction. The function runUMAP(), provided in the package scater, takes in a SingleCellExperiment object and returns a modified object that contains the t-SNE coordinates in the reduced dimension.

set.seed(1111001)
sce.24 <- runUMAP(sce.24, dimred="pca.corrected")
save(sce.24, file=paste0("sce.24.UMAP.RData"))
set.seed(1111001)
sce.25 <- runUMAP(sce.25, dimred="pca.corrected")
save(sce.25, file=paste0("sce.25.UMAP.RData"))
set.seed(1111001)
sce.28 <- runUMAP(sce.28, dimred="pca.corrected")
save(sce.28, file=paste0("sce.28.UMAP.RData"))
load("sce.24.UMAP.RData")
load("sce.25.UMAP.RData")
load("sce.28.UMAP.RData")

Go to Top: Dimension Reduction

Go to Top: Data Preprocessing

Batch Effect Diagnosis

In this subsection, we perform principal components analysis (PCA) on the combined dataframe of sce.24, sce.25, and sce.28. The function runPCA() performs dimension reduction based on the subset of genes with positive biological component via randomized singular value decomposition algorithm (SVD) (set by RandomParam()).

The following dataframe corrected refers to the merged sce.24, sce.25, and sce.28 after batch effect correction using [MNN batch effect correction method](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6152897/).

load("dec.RData")
corrected <- cbind(sce.24,sce.25,sce.28)
set.seed(0010101010)
corrected <- runPCA(corrected, subset_row=dec.expressed, BSPARAM=BiocSingular::RandomParam())
corrected$sample <- factor(corrected$sample)
corrected$batch = rep(1, ncol(corrected))
corrected$batch[which(corrected$sample==25)] <- 2
corrected$batch[which(corrected$sample==28)] <- 3
corrected$batch <- factor(corrected$batch)
save(corrected, file=paste0("corrected.RData"))

We have completed processing to obtain our desired dataframe and are ready to move on to the next section - cell clustering. The generated dataframe by this section should look like:

load("corrected.RData")
corrected
## class: SingleCellExperiment 
## dim: 29452 15935 
## metadata(0):
## assays(2): counts logcounts
## rownames(29452): Xkr4 Gm1992 ... Vmn2r122 CAAA01147332.1
## rowData names(2): ENSEMBL SYMBOL
## colnames(15935): cell_63220 cell_63221 ... cell_95725 cell_95726
## colData names(18): cell barcode ... sizeFactor batch
## reducedDimNames(4): pca.corrected umap UMAP PCA
## altExpNames(0):

Its corresponding PCA plot is as following:

plotReducedDim(corrected, "pca.corrected", colour_by="sample")+
  scale_colour_manual(values = c("green", "blue", "orange"))

Go to Top: Batch Effect Diagnosis

Go to Top: Data Preprocessing

Cell Clustering (Standard Pipeline in Preparation for BioTIP Analysis Inputs)

In this demonstration, we clustered cells using a k-nearest neighbor graph method. More examples using other popular clustering methods (consensus, Leiden) are given in the xxxxx. We showed in these examples that BioTIP works equally well on cell clusters generated by those methods. BioTIP is also applicable to the results of soft-thresholding clusters when considering transition cells as a unique cluster. However, it is essential to tune an optimal parameter setting to generate a reasonable number of cellular clusters. After removing cells that were annotated as putative doublets or cytoplasm-stripped nuclei, we focus on 12 annotated cell types covering the developmental trajectory from somitic mesoderm to endothelium or blood progenitors, which covers the haemato-endothelial progenitor (HEP) branching we are interested. We got 7,240 E8.25 cells of these 12 cell types.

int <- c('Somitic mesoderm','Intermediate mesoderm','ExE mesoderm','Paraxial mesoderm','Allantois','Pharyngeal mesoderm','Cardiomyocytes', 'Mesenchyme','Haematoendothelial progenitors','Endothelium','Blood progenitors 1','Blood progenitors 2')
sce <- corrected[,corrected$celltype %in% int]
table(sce$stage)
## 
## E8.25 
##  7240

Expressed Genes Selection

The selection of expressed genes follow two criteria:

  1. The gene has positive biological components in each dataset when the overall variance of the log-normalized expression are split into biologically relevant and noise components.

  2. The gene has positive biological components in all combined E8.25 cells while accounting for batch effects and additionally assuming Poisson distribution about the noise.

In the absence of spike-in data, we can make some distributional assumptions about the noise. For example, if we only consider technical noise from library preparation and sequencing, UMI counts typically exhibit near-Poisson variation. This can be used to construct a mean-variance trend in the log-counts with the function modelGeneVarByPoisson(), which is provided by the R-package scran. A plot of the variance of log-expression versus the mean of log-expression is shown below.

set.seed(0010101)
dec.pois <- modelGeneVarByPoisson(sce,  block=sce$batch)

According to the criteria of selecting the expressed genes, we intersect the sets of genes satisfying both criteria listed above. Here, we obtained 10938 expressed genes, as shown below.

load("dec.RData")
expressed.criteria.1 <- rownames(dec.pois)[dec.pois$bio>0]
expressed.criteria.2 <- rownames(sce)[dec24$bio > 0 & dec25$bio>0 & dec28$bio>0]
expressed <- intersect(expressed.criteria.1, expressed.criteria.2)
sce <- sce[expressed.criteria.2,]
length(expressed)
## [1] 10938

Go to Top: Expressed Genes Selection

Go to Top: Cell Clustering (Standard Pipeline in Preparation for BioTIP Analysis Inputs)

Cell Clustering

In the preceding section, we obtained 10938 expressed genes. We will run UMAP only on the selected expressed genes.

sce <- runUMAP(sce, dimred='pca.corrected')
save(sce, file=paste0("sce.hvg.UMAP.RData"))

With our obtained data frame in reduced dimension, we will make k nearest-neighbor graph (in this experiment, we take k=10) using the buildSNNGraph() function. This will return us (coded in the R object g.10) a graph where nodes represent cells, and edges represent connection between nearest neighbors. Then, the function cluster_walktrap() finds densely connected sub-graphs in g.10 via random walks. Short random walks tend to stay in the same densely connected sub-graphs. We will then assign each of the 7240 cells to the cluster membership obtained by the cluster_walktrap().

load("sce.hvg.UMAP.RData")
g.10 <- buildSNNGraph(sce, k=10, use.dimred = 'pca.corrected')
clust.10 <- cluster_walktrap(g.10)$membership
colLabels(sce) <- factor(clust.10)

Go to Top: Cell Clustering

Go to Top: Cell Clustering (Standard Pipeline in Preparation for BioTIP Analysis Inputs)

Optional: Trajectory Analysis

Note that BioTIP can identify tipping points and critical transition states (CTSs) independent of trajectory analysis. This step is optional and shown here for demonstrative purpose only.

After assigning the cells to their respective clusters by cluster_walktrap(), we are ready to perform trajectory analysis.

by.cluster <- aggregateAcrossCells(sce, ids=colData(sce)$label)
centroids.cluster <- reducedDim(by.cluster, "pca.corrected")
dmat.cluster <- as.matrix(dist(centroids.cluster))
set.seed(1000)
trajectory.cluster <- graph.adjacency(dmat.cluster, mode = "undirected", weighted = TRUE)
plot(minimum.spanning.tree(trajectory.cluster))

wilcox.z <- pairwiseWilcox(sce, colLabels(sce), lfc=1, direction="up")
markers.z <- getTopMarkers(wilcox.z$statistics, wilcox.z$pairs, pairwise=FALSE, n=50)
markers.up <- findMarkers(sce, test="wilcox", groups=sce$label, direction="up") 
save(markers.z,markers.up, file='markers.up.RData')
load("markers.up.RData")
for(chosen in list(6,7,13,15)) {
  interesting.up <- markers.up[[chosen]]
  best.set <- interesting.up[interesting.up$Top <= 5,]
  AUCs <- getMarkerEffects(best.set, prefix="AUC")
  pheatmap(AUCs, breaks=seq(-1, 1, length.out=101),
             fontsize = 10, main= paste('S',chosen))
}

The below plots show the expression values in sce for the set of genes in newmarkers. The following 9 genes were chosen based on their biology.

newmarkers <-  unique(c("Etv2",'Cdh5', 'Tal1', 'Nkx2-5','Lmo2', 'Sox2', 'Actc1', 'Kdr', 'T'))
plotExpression(sce, features=newmarkers, x="label", colour_by="label", ncol=3)

par(mar = c(4, 4, 0.5, 0.5))
plotReducedDim(sce, 'UMAP', colour_by="celltype", text_by="celltype", text_size = 3)
plotReducedDim(sce, 'UMAP',colour_by="label", text_by="label", text_size = 3)

Go to Top: Optional: Trajectory Analysis

Go to Top: Cell Clustering (Standard Pipeline in Preparation for BioTIP Analysis Inputs)

BioTIP Analysis

BioTIP Analysis Alternative: BioTIP Wrap Function

The above section demonstrated BioTIP analysis pipeline. We have recently developed and made available a BioTIP.wrap() function, whose original code is available through the hyperlink to our group’s GitHub page. There are two purposes of the BioTIP.wrap() function.

  1. This is developed to make BioTIP more user friendly.

  2. More importantly, BioTIP.wrap() enables iterative calls to estimate the stability of predictions. However, we recommend experts or experienced R-users to perform each step with their own tweaking as needed.

The resulting plots and R.Data files would be stored into the folder named to the parameter subDir.

First, we will set key parameters as followings:

getNetwork.cut.fdr=0.2 # to construct RW network and extract co-expressed gene modules
getTopMCI.gene.minsize=60 # minimum number of genes in an identified CTS. We found a number between 20 and 60 works rubustly for high-throughput scRNA-seq datasets.
getTopMCI.n.states=5 # a number setting the number of clusters to check for the significance of DNB score. This parameter can be enlarged when inputting with more clusters. We recommend to set this number as the expected number of transitions states +1/ 
source("BioTIP.wrap.R")
load("sce.hvg.UMAP.RData")
n.getTopHVGs = 4000
libsf.var <- getTopHVGs(dec.pois, n=n.getTopHVGs)
libsf.var <- intersect(libsf.var, rownames(sce))
sce <- sce[libsf.var,]
BioTIP.wrap(sce, samplesL, subDir = 'BioTIP.wrap', getTopMCI.gene.minsize=getTopMCI.gene.minsize, getTopMCI.n.states=getTopMCI.n.states, getNetwork.cut.fdr=getNetwork.cut.fdr)

BioTIP Main Functions Walk Through

We start by creating a list (samplesL) consisting of the 19 clusters of cells.

samplesL <- split(rownames(colData(sce)),f = colData(sce)$label)

Transcript Pre-selection

First, we select the top 4000 highly variable genes (HVGs) and log-transform the data.

hvg <- getTopHVGs(dec.pois, n=4000)
hvg <- intersect(hvg, rownames(sce))
dat <- sce[hvg,]
logmat <- as.matrix(logcounts(dat))

Then, we will use BioTIP function optimize.sd_selection() to pre-select transcript. The optimize.sd_selection() function optimally filters a multi-state dataset based on a cutoff value (given by cut.preselect) for standard deviation per state. A list of dataframe of filtered transcripts with the highest standard deviations are selected. For this experiment, we set cut.preselect to be 0.1, under which the function will select top 10% transcripts.

cut.preselect = 0.1
set.seed(2020)
testres <- optimize.sd_selection(logmat[,unlist(samplesL)], samplesL, B=100, cutoff=cut.preselect, times=.75, percent=0.8)
save(testres, file=paste0("optimized_test_sd_selection_E8.25.RData"), compress=TRUE)

Go to Top: Transcript Pre-selection

Go to Top: BioTIP Analysis

Network Partition

The function getNetwork() will use the Pearson Correlation Coefficient (PCC) analysis to assemble a correlation network of nodes (e.g., co-expressed transcripts) for each state using the R package igraph. getNetwork() will return a list (igraphL) of igraph objects, each being a network of correlated nodes whose PCCs meet the significant criteria based on the false discovery rate (FDR) control (set by cut.fdr to be 0.2). Then, for each state, the function getCluster_methods() splits the correlation generated network into several sub-networks, or modules.

cut.fdr = 0.2
igraphL <- getNetwork(testres, fdr = cut.fdr)
cluster <- getCluster_methods(igraphL)

Go to Top: Network Partition

Go to Top: BioTIP Analysis

Putative Critical Transition Signals (CTSs) Identification by MCI score

After obtaining the modules, we pass the network and its modules to the function getMCI(), which calculates a module critical index (MCI) score for each module per state.

membersL <- getMCI(cluster, testres, adjust.size = FALSE, fun='BioTIP')
save(membersL, file=paste0("membersL.RData"))

The following plot shows MCI scores of the classified cluster samples at 19 different stages.

load("optimized_test_sd_selection_E8.25.RData")
load("membersL.RData")
cut.minsize = 60
par(mar=c(2,2,2,2))
plotBar_MCI(membersL, ylim=c(0,5), minsize = cut.minsize)

The R-object, topMCI, generated by the BioTIP function getTopMCI(), contains a list of the n (here we set n=11) top MCI scores. maxMCIms contains the modules with the maximum MCI scores for each state reported by getMaxMCImember(). Then, maxMCI, generated by getMaxStats(), contains the maximum MCI score for each cluster.

topMCI = getTopMCI(membersL[["members"]], membersL[["MCI"]], membersL[["MCI"]], min=cut.minsize, n=length(membersL))
maxMCIms <- getMaxMCImember(membersL[["members"]], membersL[["MCI"]], min=cut.minsize, n=2)
maxMCI = getMaxStats(membersL[['MCI']], maxMCIms[['idx']])

Then, the following commands use getCTS() to capture the unique IDs and symbols of the network nodes in each clusters with top MCI scores. We drop insignificant CTSs (with MCI score less than MCIbottom).

CTS.Lib = getCTS(maxMCI[names(topMCI)], maxMCIms[["members"]][names(topMCI)])

tmp <- unlist(lapply(maxMCIms[['idx']][names(topMCI)], length))
(whoistop2nd <- names(tmp[tmp==2]))
if(length(whoistop2nd)>0) CTS.Lib = append(CTS.Lib, maxMCIms[["2topest.members"]][whoistop2nd])

maxMCI <- maxMCI[names(CTS.Lib)[1:length(maxMCI)]]
maxMCI <- c(maxMCI, getNextMaxStats(membersL[['MCI']], idL=maxMCIms[['idx']], whoistop2nd))

CTS.Lib.Symbol <- CTS.Lib
for (i in 1:length(CTS.Lib)) {
  CTS.Lib[[i]] <- rowData(sce)[CTS.Lib.Symbol[[i]],]$ENSEMBL
}
MCIbottom = 2
CTS.Lib <- CTS.Lib[1:length(which(topMCI >= MCIbottom))]
CTS.Lib.Symbol <- CTS.Lib.Symbol[1:length(which(topMCI >= MCIbottom))]
maxMCI <- maxMCI[1:length(which(topMCI >= MCIbottom))]
save(CTS.Lib, CTS.Lib.Symbol, maxMCI, file=paste0("CTS_Lib_E8.25.RData"))

The following output CTS.Lib.Symbol has clusters ordered in descending MCI score.

load("CTS_Lib_E8.25.RData")
names(CTS.Lib.Symbol)
## [1] "15" "6"  "16" "13"

cor.shrink() function calculates the average pairwise correlation between genes (MARGIN=1). It returns a pairwise correlation matrix M. Saving this reusable M object will speed up calculation.

M <- cor.shrink(logmat[,unlist(samplesL)], Y = NULL, MARGIN = 1, shrink = TRUE)
save(M, file=paste0("CTS_ShrinkM_E8.25.RData"), compress=TRUE)

The following two code chunks simulate and visualize Dynamic Network Biomarker (DNB) scores. simulationMCI() gets the DNB scores for randomly selected transcript ids. plot_MCI_Simulation() visualized the DNB scores obtained by the previous step.

C = 100
simuMCI = list()
set.seed(2020)
for (i in 1:length(CTS.Lib)){
     n <- length(CTS.Lib[[i]])
     simuMCI[[i]] <- simulationMCI(n, samplesL, logmat,  B=C, fun="BioTIP", M=M)
}
save(simuMCI, file=paste0("SimuMCI_",C,"_E8.25", cut.preselect, "_fdr",cut.fdr,"_minsize",cut.minsize,".RData"))
load("CTS_ShrinkM_E8.25.RData")
load("SimuMCI_100_E8.250.1_fdr0.2_minsize60.RData")
for (i in 1:length(CTS.Lib)){
   plot_MCI_Simulation(maxMCI[i], simuMCI[[i]], las=2, 
                       main=paste0("Cluster ", names(maxMCI)[i], "; ",
                                   length(CTS.Lib[[i]]), " genes", "\n","vs. ", 
                                   "100 times of gene-permutation"),
                       which2point=names(maxMCI)[i])
}

Go to Top: Putative Critical Transition Signals (CTSs) Identification by MCI score

Go to Top: BioTIP Analysis

Tipping-Points Identification and CTSs Evaluation

In this section, we use our Ic.shrink score to evaluate our identified CTSs. In the folloinwg plots, Ic* refers to the Ic.shrink score.

C= 100
BioTIP_scores <- SimResults_g <- list()

set.seed(101010)
for(i in 1:length(CTS.Lib)){
     CTS <- CTS.Lib.Symbol[[i]]
     n <- length(CTS)
     BioTIP_scores[[i]] <- getIc(logmat[,unlist(samplesL)], samplesL, CTS, fun="BioTIP", shrink=TRUE, PCC_sample.target = 'none' )
     SimResults_g[[i]]  <- simulation_Ic(n, samplesL, logmat, B=C, fun="BioTIP", shrink=TRUE, PCC_sample.target = 'none')
}
names(BioTIP_scores) <- names(SimResults_g) <- names(CTS.Lib)
save(SimResults_g, BioTIP_scores, file=paste0("LibSF_IC_sim_BioTIP_E8.25_t4.RData"), compress=TRUE)
load("LibSF_IC_sim_BioTIP_E8.25_t4.RData")
ylim = 1
par(mfrow=c(2,2)) 
for(i in 1:length(BioTIP_scores)){
    n = length(CTS.Lib[[i]])
    interesting = which(names(samplesL) == names(BioTIP_scores[i]))
    p = length(which(SimResults_g[[i]][interesting,] >= BioTIP_scores[[i]][names(BioTIP_scores)[i]]))
    p = p/ncol(SimResults_g[[i]])
    p2 = length(which(SimResults_g[[i]] >= BioTIP_scores[[i]][names(BioTIP_scores)[i]]))
    p2 = p2/ncol(SimResults_g[[i]])
    p2 = p2/nrow(SimResults_g[[i]])
}
  ylim=1
  par(mfrow=c(4,2)) 
  for(i in 1:length(BioTIP_scores)){
    n <- length(CTS.Lib[[i]])
    interesting = which(names(samplesL) == names(BioTIP_scores[i]))
    plot_Ic_Simulation(BioTIP_scores[[i]], SimResults_g[[i]], las = 2, ylab="Ic*", ylim=c(0,ylim),
                       
                       main=paste("Cluster ",names(CTS.Lib)[i],"_",n, "genes", "\n","vs. ", 
                                  "100 gene-permutations"),
                       fun="matplot", which2point= interesting)
    plot_SS_Simulation(BioTIP_scores[[i]], SimResults_g[[i]], 
                       main = paste("Delta Ic*",n,"genes"), ylab=NULL, 
                       xlim=range(c(BioTIP_scores[[i]][names(BioTIP_scores)[i]],
                                    SimResults_g[[i]])))
  }

Go to Top: Tipping-Points Identification and CTSs Evaluation

Go to Top: BioTIP Analysis

CTS Evaluation using Independent Dataset: Ibarra-Soria 2018

In the previous section, we have obtained BioTIP’s identified CTSs. In the section, we will evaluate our obtained CTSs using the related but independent data set Ibarra-Soria 2018.

load("IbarraSoria.RData")
logmat <- as.matrix(logcounts(sce))

Among 33 pre-defined cell subtypes, we focus on the cells of 16 developing mesoderm subtypes:

pseudoOrder <- c('extraembryonicMesoderm', 'endothelial.a', 'endothelial.b', 'endothelial.c', 'endothelial.d','blood',  'mesodermProgenitors', 'presomiticMesoderm.b' , 'presomiticMesoderm.a', 'somiticMesoderm', 'mixedMesoderm.a', 'pharyngealMesoderm', 'mixedMesoderm.b', 'cardiac.a', 'cardiac.b', 'cardiac.c')
samplesL <- split(rownames(colData(sce)),f = colData(sce)$subcelltype)
samplesL <- samplesL[pseudoOrder]
save(logmat, samplesL, file=paste0("samplesL_IbarraSoria_pseudoOrder.RData"))
load("samplesL_IbarraSoria_pseudoOrder.RData")

The following step formats gene alias with official symbols.

CTS.Lib.Symbol[[1]][which(CTS.Lib.Symbol[[1]] == 'Sept1')] = 'Septin1'
CTS.Lib.Symbol[[1]][which(CTS.Lib.Symbol[[1]] == 'Grasp')] = 'Tamalin'
CTS.Lib.Symbol[[1]][which(CTS.Lib.Symbol[[1]] == 'Fam57b')] = 'Tlcd3b'
CTS.Lib.Symbol[[3]][which(CTS.Lib.Symbol[[3]] == 'Sept4')] = 'Septin4'
CTS.Lib.Symbol[[3]][which(CTS.Lib.Symbol[[3]] == 'Fam69a')] = 'Dipk1a'
CTS.Lib.Symbol[[3]][which(CTS.Lib.Symbol[[3]] == 'Sdpr')] = 'Cavin2' 
CTS.Lib.Symbol[[3]][which(CTS.Lib.Symbol[[3]] == 'Wisp1')] = 'Ccn4'
CTS.Lib.Symbol[[4]][which(CTS.Lib.Symbol[[4]] == 'Cxx1b')] = 'Rtl8b'
for(i in 1:4)
{
  CTS <-intersect(CTS.Lib.Symbol[[i]], rownames(logmat))
  BioTIP_scores <- getIc(logmat[,unlist(samplesL)], samplesL, CTS, fun="BioTIP", shrink=TRUE, PCC_sample.target='none')
  state.in.order = names(samplesL)
  n <- length(BioTIP_scores)
  par(mar=c(12,5,5,5))
  plot(BioTIP_scores, type='b',xaxt = "n", main=paste0('CTS: E8.25 S', names(CTS.Lib.Symbol)[i]))
  axis(side=1, at=1:n, labels=state.in.order,cex.axis=1, las=2)
}

C = 100
for(j in c(1,2,3,4)){ 
    CTS <- intersect(CTS.Lib.Symbol[[j]],  rownames(logmat))
    n = length(CTS) 
    SimResults_Ic <- simulation_Ic(n, samplesL, logmat, B=C, fun="BioTIP", shrink=TRUE)
    Ic <- getIc(logmat, sampleL = samplesL, genes = CTS, output = 'Ic', fun = "BioTIP", shrink = TRUE)  
    save(Ic, SimResults_Ic, file=paste0('IC_sim_S',names(CTS.Lib.Symbol)[j],'_CTS',n,'_PermutateGene.in.IbarraSoria2018_PCCsNoShrink.RData'), compress=TRUE)
}
for(j in c(1,2,3,4)){
  CTS <- intersect(CTS.Lib.Symbol[[j]],  rownames(logmat))
  n = length(CTS)
  load(file=paste0('IC_sim_S',names(CTS.Lib.Symbol)[j],'_CTS',n,'_PermutateGene.in.IbarraSoria2018_PCCsNoShrink.RData'))
  rownames(SimResults_Ic) <- names(Ic)
  Ic <- Ic[pseudoOrder]
  SimResults_Ic <- SimResults_Ic[pseudoOrder,]
  
  par(mar=c(12,5,5,5))
  boxplot(t(SimResults_Ic), ylim=c(0,0.5), xlab='states', ylab='Ic*', las=2, main=paste('E8.25 S', names(CTS.Lib.Symbol)[j], n, 'genes'))
  lines(1:length(Ic), Ic, col='red', xaxt = "n")

  M1 <- apply(SimResults_Ic, 1, min)
  M2 <- apply(SimResults_Ic, 1, max)
  cut <- M1+2*(M2-M1)
  lines(1:length(Ic), cut, col='grey', xaxt = "n", lty=3, lwd=2)
  
  plot_SS_Simulation(Ic,  SimResults_Ic, main = paste("Delta of Ic*"), ylab='control') 
}

Go to Top: CTS Evaluation using Independent Dataset: Ibarra-Soria 2018

Session Info

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] stringr_1.4.0               scran_1.18.7               
##  [3] scater_1.18.6               ggplot2_3.3.5              
##  [5] psych_2.1.9                 pheatmap_1.0.12            
##  [7] org.Mm.eg.db_3.12.0         AnnotationDbi_1.52.0       
##  [9] igraph_1.2.11               gridExtra_2.3              
## [11] gplots_3.1.1                edgeR_3.32.1               
## [13] limma_3.46.0                dynamicTreeCut_1.63-1      
## [15] corrplot_0.92               cluster_2.1.2              
## [17] batchelor_1.6.3             MouseGastrulationData_1.4.0
## [19] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
## [21] Biobase_2.50.0              GenomicRanges_1.42.0       
## [23] GenomeInfoDb_1.26.7         IRanges_2.24.1             
## [25] S4Vectors_0.28.1            BiocGenerics_0.36.1        
## [27] MatrixGenerics_1.2.1        matrixStats_0.61.0         
## 
## loaded via a namespace (and not attached):
##   [1] ggbeeswarm_0.6.0              colorspace_2.0-2             
##   [3] ellipsis_0.3.2                scuttle_1.0.4                
##   [5] bluster_1.0.0                 XVector_0.30.0               
##   [7] BiocNeighbors_1.8.2           farver_2.1.0                 
##   [9] bit64_4.0.5                   interactiveDisplayBase_1.28.0
##  [11] fansi_1.0.2                   sparseMatrixStats_1.2.1      
##  [13] mnormt_2.0.2                  cachem_1.0.6                 
##  [15] knitr_1.37                    jsonlite_1.7.3               
##  [17] ResidualMatrix_1.0.0          dbplyr_2.1.1                 
##  [19] shiny_1.7.1                   BiocManager_1.30.16          
##  [21] compiler_4.0.5                httr_1.4.2                   
##  [23] dqrng_0.3.0                   assertthat_0.2.1             
##  [25] Matrix_1.4-0                  fastmap_1.1.0                
##  [27] later_1.3.0                   BiocSingular_1.6.0           
##  [29] htmltools_0.5.2               tools_4.0.5                  
##  [31] rsvd_1.0.5                    gtable_0.3.0                 
##  [33] glue_1.6.1                    GenomeInfoDbData_1.2.4       
##  [35] dplyr_1.0.7                   rappdirs_0.3.3               
##  [37] Rcpp_1.0.7                    jquerylib_0.1.4              
##  [39] vctrs_0.3.8                   nlme_3.1-155                 
##  [41] ExperimentHub_1.16.1          DelayedMatrixStats_1.12.3    
##  [43] xfun_0.29                     beachmat_2.6.4               
##  [45] mime_0.12                     lifecycle_1.0.1              
##  [47] irlba_2.3.5                   gtools_3.9.2                 
##  [49] statmod_1.4.36                AnnotationHub_2.22.1         
##  [51] zlibbioc_1.36.0               scales_1.1.1                 
##  [53] promises_1.2.0.1              RColorBrewer_1.1-2           
##  [55] yaml_2.2.1                    curl_4.3.2                   
##  [57] memoise_2.0.1                 sass_0.4.0                   
##  [59] stringi_1.7.6                 RSQLite_2.2.9                
##  [61] highr_0.9                     BiocVersion_3.12.0           
##  [63] caTools_1.18.2                BiocParallel_1.24.1          
##  [65] rlang_0.4.12                  pkgconfig_2.0.3              
##  [67] bitops_1.0-7                  evaluate_0.14                
##  [69] lattice_0.20-45               purrr_0.3.4                  
##  [71] labeling_0.4.2                cowplot_1.1.1                
##  [73] bit_4.0.4                     tidyselect_1.1.1             
##  [75] magrittr_2.0.1                R6_2.5.1                     
##  [77] generics_0.1.1                DelayedArray_0.16.3          
##  [79] DBI_1.1.2                     pillar_1.6.4                 
##  [81] withr_2.4.3                   RCurl_1.98-1.5               
##  [83] tibble_3.1.6                  crayon_1.4.2                 
##  [85] KernSmooth_2.23-20            utf8_1.2.2                   
##  [87] tmvnsim_1.0-2                 BiocFileCache_1.14.0         
##  [89] rmarkdown_2.11                viridis_0.6.2                
##  [91] locfit_1.5-9.4                grid_4.0.5                   
##  [93] blob_1.2.2                    digest_0.6.29                
##  [95] xtable_1.8-4                  httpuv_1.6.5                 
##  [97] munsell_0.5.0                 viridisLite_0.4.0            
##  [99] beeswarm_0.4.0                vipor_0.4.5                  
## [101] bslib_0.3.1